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Abstract of JP 10214258 (A) 

PROBLEM TO BE SOLVED: To provide a data 
processing system in whicli a hardware resource 
can be efficiently utilized, and spatial constraint can 
be reduced. SOLUTION: Image recognition and 
syntliesis or speech recognition and synthesis 
devices 40-46 which need a high speed processing 
or a large-scaled storage capacity are prepared only 
on a server 1 02 side. The input and output of 
images or speeches are operated on a client 100 
side. Image data or speech data inputted on the 
client 100 side are transferred through a network 
104 to the server 102 side, and recognized on the 
server side. Then, the recognized Image data and 
speech data or Image data and speech data 
synthesized on the server side are transferred 
through the network 104 to the client side, and 
displayed or reproduced on the client side. 
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* NOTICES * 

JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

1. This document Inas been translated by computer. So the translation may not reflect the original 

precisely. 

2. **** shows the word which can not be translated. 
3,In the drawings, any words are not translated. 



CLAIMS 

[Claim(s)] 

[Claim 1]A client, An input of a picture. Carry out and image data. An input of an image input means 
to acquire and a sound. By voice input means which carries out and obtains voice data, picture 
output means which displays a picture based on image data, voice output means which reproduces a 
sound based on voice data, network interface which makes connection with a network, said image 
input means, and said voice input means. While transmitting image data and voice data which were 
obtained, respectively to the server side, Have a client control means which controls operation which 
supplies image data and voice data which were transmitted from the server side to said picture 
output means and said voice output means, respectively, and a server, Image data transmitted from a 
client side. An image recognition means to recognize, a voice recognition means which recognizes 
voice data transmitted from a client side, an image compositing means which compounds image data, 
a voice synthesis means which compounds voice data, a network interface which makes connection 
with a network, While supplying image data and voice data which were transmitted from a client side 
to said image recognition means and said voice recognition means, respectively, It has a server 
control means which controls operation which transmits image data and voice data which were 
compounded by said image compositing means and said voice synthesis means to a client side, 
respectively, A data processing system connecting these clients and a server with a network. 
[Claim 2]The data processing system according to claim 1 having a client identification device which 
identifies a client of the data transfer point from said server while connecting two or more said 
clients to said network. 

[Claim 3]Claim 1 having a server selecting means which chooses a server of the data transfer point 
from a client according to a processing situation of said server while connecting two or more said 
servers to said network, or a data processing system given in either of 2. 

[Claim 4]A data processing system given in either Claim 1 provided with a sensor means for detecting 
a processing object of a picture and a sound, 2 or 3. 



[Translation done.] 
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* NOTICES * 

JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

IThis document has been translated by connputer. So the translation may not reflect the original 

precisely. 

2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Field of the Invention]This invention relates to the data processing system which processes a 

request to image data and voice data, 

[0002] 

[Background of the InventionlAs a data processing system which performs processing of various 
kinds, such as recognition and composition, to image data or voice data, there are some which are 
shown in drawing 3 and drawing 4, for example. This example is a system which the specific person 
registered a priori appears in front of a system, and inputs that sound and face, specifies a person, 
and carries out the reproducing output of a registered message and picture. The example of 
composition of the system is shown in drawing 3 , and two or more things which blocked this further 
are shown in drawing 4 , In these figures, sensor reading of a person's arrival etc. is performed by the 
sensor 10 and the sensor readers 12, such as an infrared sensor, a vibration sensor, and a sound 
sensor, and the sensor read station 14 is constituted by these. 

[0003]A person's image input and conversion to digitization data are performed by the camera 16 and 
the picture input device 18, and the image input part 20 is constituted by these. A person's voice 
input and conversion to digitization data are performed by the microphone 22 and the speech input 
system 24, and the voice input part 26 is constituted by these. The generating picture based on the 
digitized image data is performed by the display device 28 and the image output device 30, and the 
image output part 32 is constituted by these. Voice response based on the digitized voice data is 
performed by the loudspeaker 34 and the speech output unit 36, and the voice output part 38 is 
constituted by these. 

[0004]A person's face image and the audio data which were obtained by each of these pieces of 
equipment are supplied to the image recognition device 40 or the voice recognition equipment 42, and 
analysis processing for recognition is performed here. When there are the image data and voice data 
which should be compounded, based on them, a picture and a sound are compounded with the image 
compositing device 44 or the voice synthesizer 46. A person judgment, operation analysis, etc. occur 
as an example of image recognition. There are a judgment of a specified speaker, analysis of 
conversation, etc. as an example of speech recognition. As an example of picture composition, there 
are a rendering of a three-dimensional picture, generation of a video data, etc. There are voice 
synthesis by arbitrary tones, etc. as an example of voice synthesis. 

[0005]The picture after composition is supplied to the image output device 30, and also is displayed 
on the display device 28. The sound after composition is supplied to the speech output unit 36, and 
also is reproduced with the microphone 22. These processings are controlled by the control device 
48. 

[0006]The above input/output devices and recognition synthesizer units of a picture or a sound are 
formed for every system, respectively, as shown in drawing 4 . Namely, sensor reading, the input of a 
picture or a sound, a recognition compositing process, and an output are independently performed for 
every system, respectively. 
[0007] 

[Problem to be solved by the invention]By the way, generally, by recognition of a picture, or audio 
recognition, a vast quantity of data is processed at high speed, or is memorized. For this reason, 
high-speed DSP for GPU or data processing, dedicated hardware, mass memory storage, etc. are 
needed. High-speed DSP for GPU or high-speed data processing and other dedicated hardwares are 
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needed also about the picture composition which generates the picture displayed similarly, and the 
voice synthesis which generates a sound. Recognition and composition of a picture or a sound are 
realized depending on them. 

[0008] However, each of those high-speed CPUs, DSPs for data processing, hardwares for exclusive 
use, and mass memory storage becomes complicated [ an equipment configuration ] while it is 
expensive and causes the cost hike of a system. Therefore, if it has composition independently 
provided with the recognition device or synthesizer unit of a picture or a sound for every system like 
the background art mentioned above, each system needs to consist of highly efficient computers, 
such as a workstation. Therefore, the cost of each system will become very expensive. As especially 
shown in drawing 4 , when preparing two or more systems provided with the same function, since the 
cost of a terminal is high, cost will start in proportion to the number, and preparing many systems will 
need huge cost. It cannot be said from the point of the miniaturization of a system, and a space 
saving that it is desirable, either. 

[0009]This invention is what noted the above point, and that purpose is to provide the data 
processing system of the low cost which can utilize hardware resources efficiently. Other purposes 
are to provide the data processing system which can reduce spatial restrictions. 
[0010] 

[Means for solving problem]In order to attain said purpose, in this invention the client (100) side. An 
image input means which inputs a picture and obtains image data (16, 18), A voice input means which 
inputs a sound and obtains voice data (22, 24), A picture output means which displays a picture 
based on image data (28, 30), A voice output means which reproduces a sound based on voice data 
(34, 36), While transmitting image data and voice data which were obtained, respectively by network 
interface (108) which makes connection with a network (104), said image input means, and said voice 
input means to the server side, It has a client control means (106) which controls operation which 
supplies image data and voice data which were transmitted from the server side to said picture 
output means and said voice output means, respectively. 

[0011]An image recognition means (40) the server (102) side recognizes image data transmitted from 
a client side to be, A voice recognition means (42) which recognizes voice data transmitted from a 
client side. An image compositing means (44) which compounds image data, a voice synthesis means 
(46) which compounds voice data, While supplying image data and voice data which were transmitted 
from a network interface (112) which makes connection with a network, and a client side to said 
image recognition means and said voice recognition means, respectively, It has a server control 
means (110) which controls operation which transmits image data and voice data which were 
compounded by said image compositing means and said voice synthesis means to a client side, 
respectively. 

[0012]And these clients and a server were connected by the network. 

[0013]According to the main forms, while two or more said clients are connected to said network, it 
has a client identification device (200,202) which identifies the client of the data transfer point from 
said server. Or while two or more said servers are connected to said network, it has a server 
selecting means (300,302) which chooses the server of the data transfer point from a client 
according to the processing situation of said server. It has a sensor means (10, 12) for detecting the 
processing object of a picture and a sound. 

[0014]In other main forms. An audio input. The speech input system changed into the digitized input 
voice data, the picture input device changed into the inputted image data by which the input of the 
picture was digitized, the speech output unit which reproduces the digitized output sound data, and 
the digitized output image data. The client is provided with the image output device to display, the 
sensor which detects existence of a person, the network interface which makes connection with a 
network, and the client control device which controls operation of a client .The input voice data 
transmitted via a network via the voice recognition equipment and the network which perform 
analysis and recognition. Via the image recognition device which performs analysis and recognition for 
the transmitted inputted image data, the voice synthesizer which performs composition of output 
sound data, the image compositing device which performs composition of output image data, and a 
network, said client and input voice data, The server is provided with the server control device which 
controls the operation by the side of the server side network interface which performs transmission 
of inputted image data, output sound data, and output image data, and a server. And these clients and 
servers are connected by the network. 

[001 5] According to this invention, input and output of a picture or a sound are performed by a client 
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side. Image data and voice data which were inputted by a client side are transmitted to the server 
side through a network, and recognition processing is carried out by the server side. And image data 
and voice data by which the compositing process was carried out by the image data [ after 
recognition processing] and voice data, or server side are transmitted to a client side through a 
network, and are displayed or reproduced by a client side. Since what is necessary is to prepare only 
for the server side equipment which needs high-speed processing and a large-scale storage capacity, 
such as image recognition and composition, speech recognition, composition, hardware resources are 
efficiently utilizable. The above and other purposes of this invention, the feature, and an advantage 
will become clear from the following detailed explanation and an accompanying drawing. 
[0016] 

[Mode for carrying out the invention]Hereafter, an embodiment of the invention is described in detail, 

referring to an embodiment. 

[0017] 

[Work example 1]First, Embodiment 1 is described with reference to d rawing 1 and drawing 2 (A). The 
same mark is used for an element corresponding to a background art mentioned above. This system 
has composition that the server 102 was connected with the client (terminal) 100 by the network 
104, as the whole is shown in drawing 2 (A). Detailed composition of each part is shown in dray^dngJ.. 
In the figure, as for the sensor reader 1 2, the picture input device 18, the speech input system 24, 
the image output device 30, and the speech output unit 36 which were mentioned above, operation is 
controlled by the client control device 106. Each equipment [ more than ] is formed in the client 100. 
And it is constituted so that connection of the client 100 to the network 104 may be made by the 
network interface 108. 

[0018]On the other hand, the image recognition device 40, the voice recognition equipment 42, the 
image compositing device 44, and the voice synthesizer 46 which were mentioned above are formed 
in the server 102 by each. As for each of these pieces of equipment, operation is controlled by the 
server control device 110. And it is constituted so that connection of the server 102 to the network 
104 may be made by the network interface 112. 

[0019]As mentioned above, an input output section of a picture or a sound is provided in the client 
100 side, and a recognition synthesizer unit of a picture or a sound is formed in the server 102 side. 
And it has composition that the client 100 and the server 102 were connected in the network 104. 
[0020]Next, the whole operation is explained. While a specific person registered a priori appears 
before the client 100 and inputs the sound, picture of the face, and sound like a background art 
mentioned above, The person is specified by the server 102, and also a case where processing which 
carries out the reproducing output of a registered picture and a voice message by the client 100 is 
performed is explained. 

[0021] Approach of a person to the client 100 is detected by the sensor 10. Then, by the camera 16 
of the image input part 20, and the picture input device 18, a person's face is photoed and it is 
incorporated as image data. When a person utters a sound, the sound is inputted by the microphone 
22 of the voice input part 26, and the speech input system 24, and is incorporated as voice data. By 
the client control device 106, such image data and voice data which were incorporated are supplied 
to the network 104 through the network interface 108, and are transmitted to the server 102 side. 
[0022]In the server 102, data from the client 100 transmitted through the network 104 is 
incorporated by the network interface 112. Image data is supplied to the image recognition device 40 
by the server control device 110 among incorporated data, and voice data is supplied to the voice 
recognition equipment 42 by the server control device 110. 

[0023]In the image recognition device 40, image data is analyzed based on an image recognition 
algorithm prepared beforehand. For example, the analysis is conducted by techniques, such as 
"feature extraction matching" and "pattern matching." And an analysis result is compared with a 
person's face image data beforehand registered into memory storage (not shown) by the side of the 
server 102, and an applicable person is identified by the technique of choosing a person whom is in 
agreement or data approximates most. 

[0024]On the other hand, in the voice recognition equipment 42, voice data is analyzed based on 
voice-recognition algorithm prepared beforehand. For example, the analysis is conducted by 
techniques, such as "DP (Dynamic Programing) matching" and "HMM (Hidden Markov Model)," And 
an analysis result is compared with voice data beforehand registered into memory storage, such as a 
word and an idiom, and applicable language is identified by the technique of choosing what is in 
agreement or data approximates most. 
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[0025]The above recognition results are transmitted to the client 100 side via the network interface 
112 and the network 104 by the server control device 110. The client control device 106 receives an 
image recognition result and a speech recognition result which have been sent through the network 
104 via the network interface 108. Then, when there are image data and voice data which should be 
compounded, image data and voice data which should be compounded to the server 102 side via the 
network interface 108 and the network 104 with the client control device 106 are transmitted. Image 
data which should be compounded is not actual image data but number data etc. in which a picture 
decided beforehand, for example is shown. Voice data which should be compounded is also character 
string information of a text instead of data which digitized a actual voice waveform, etc., or the 
message number data of a fixed form decided beforehand. 

[0026]The server control device 110 receives image data and voice data which were received through 
the network 104 and which should be compounded via the network interface 112. And voice data 
which should compound image data which should be compounded to the image compositing device 44 
with the server control device 110 is supplied to the voice synthesizer 46, respectively. 
[0027]In the image compositing device 44, actual composite image data is generated based on 
inputted image data which should be compounded. For example, when an animation is required, a 
video data is generated from two or more pictures. When combining a picture, algorithms, such as a 
"rendering", may be used and it may draw on that spot. It is good to even read image data and a 
video data which were accumulated in memory storage formed in the server 102 side. In the voice 
synthesizer 46, actual synthetic voice data (data point) is generated based on inputted voice data 
which should be compounded. For example, a sound is compounded by algorithms, such as "analysis 
sound composition" and "rule sound composition." Only voice data reading **** accumulated in 
memory storage. 

[0028]Data compounded by the image compositing device 44 and the voice synthesizer 46 is sent to 
the client 100 side via the network interface 112 and the network 104 by the server control device 
1 10. In the client 100 side, transmitted composite image data is supplied to the image output device 
30 by the client control device 106. And a picture based on composite image data is displayed on the 
display device 28 by the image output device 30. Transmitted synthetic voice data is supplied to the 
speech output unit 36 by the client control device 106. And a sound based on synthetic voice data is 
outputted from the loudspeaker 34 with the speech output unit 36. 

[0029]As mentioned above, according to the Embodiment 1, an input/output device of a picture or a 
sound is formed in the client 100 side. And a recognition device and a synthesizer unit of a picture or 
a sound are formed in the server 102 side. In the client 100, only input and output of a picture or a 
sound are performed and data of a picture or a sound is transmitted to the server 102 side through 
the network 104. And recognition and a compositing process of a picture or a sound as which high- 
speed processing is required are performed on the server 102 using accumulation data. And the 
processing result is transmitted to the client 100 through the network 104. 

[0030]For this reason, if it sees as the whole system, it will become possible to arrange a client and a 
server separately, and restrictions to a setting space will come to be eased. Since a sensor for 
detecting existence of a person is formed in a client side according to this example, there is also an 
advantage that an erroneous decision with a picture or a sound is avoidable. 
[0031] 

[Work example 2]Next, Embodiment 2 is described, referring to drawing 2 (B). This example has 
composition of having connected the clients 100A and 100B to the server 102 through the network 
104, respectively. In the clients 100A and 100B, only input and output of a picture or a sound are 
performed like Embodiment 1. And a recognition compositing process of these pictures or a sound is 
performed by the server 102 like Embodiment 1. That is, processing to image data and voice data 
which are transmitted through a network from two or more clients is performed by one set of a 
server. 

[0032]Attestation of data between clients is possible by a method of adding identification data for 
identifying a client to data transmitting. For example, the identification data adjunct 200 is formed in 
the clients 100A and 100B, respectively. On the other hand, the client identification part 202 is 
formed in the server 102. In the client 100A of delivery origin of data, or 100B, identification data is 
added to data transmitting by the identification data adjunct 200, and it transmits to the server 102 
side. In the server 102 side, identification data is memorized in the client identification part 202 (or 
server control device 110). When returning a processing result of data transmitting to a client side, 
with reference to memorized identification data, a client applicable by the client identification part 
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202 is identified, and a processing result is transmitted. Thus, it can respond now to access from two 
or more clients. 

[0033]Thus, according to this example, by each client, only input and output of a picture or a sound 
are performed and data of a picture or a sound is transmitted to the server side through a network. 
And accumulation of recognition and a compositing process of a picture or a sound, and data in which 
high-speed processing is demanded is performed on a server provided common to two or more 
clients, and is transmitted to a client to which a picture and an audio processing result correspond. 
For this reason, highly efficient CPU, and DSP and a mass memory come to be provided common to 
each client, and cost can be reduced as a whole. Since it has composition of connecting two or more 
clients to a server through a network, and also also when a client is needed, it is only possible for 
that it may be connected to a network to utilize hardware resources efficiently. 
[0034] 

[Work example 3]Next, Embodiment 3 is described, referring to drawing 2 (C). This example has the 
composition of having connected the servers 102A and 102B to the client 100 through the network 
104, respectively. In the client 100, only input and output of a picture or a sound are performed like 
Embodiment 1. And the recognition compositing process of these pictures or a sound is performed by 
the servers 102A and 102B like Embodiment 1. That is, the processing to the image data and voice 
data which are transmitted through a network from a client distributes by two sets of servers, and is 
performed. 

[0035]Distribution of data processing between servers is possible by the method of reporting the 
processing situation in each server, i.e., a loaded condition, and a movable situation to a client side, 
choosing a server with light load by a client side, and transmitting data. For example, the processing 
situation report section 300 is formed in the servers 102A and 102B, respectively. On the other hand, 
the processing situation judgment part 302 is formed in the client 100. In the client 100 of the 
delivery origin of data, a report of the processing situation in each servers 102A and 1028 is first 
received from the processing situation report section 300 before data transfer. In the client 100, the 
processing situation of each servers 102A and 102B is investigated by the processing situation 
judgment part 302, and a server with light load is judged. And a server with light load is chosen, data 
is transmitted, and the recognition and composition are required. 

[0036]Thus, according to this example, although the number of a server is increased, two or more 
servers can be made to be able to distribute processing, and the burden of one server can be eased. 
For this reason, a cost cut can be aimed at, if it becomes possible to utilize effectively and the 
hardware resources which are inferior in capability are seen as a whole. In this example, if the 
memory which accumulates a part of hardware, for example, data, is provided common to each 
server, effective use of resources can be aimed at. 
[0037] 

[Work example 4]Next, Embodiment 4 is described, referring to drawin g 2 (D). This example combines 
Embodiment 2 and Embodiment 3 which were mentioned above. That is, it has the clients 100C, 100D, 
and 100E and the composition that the servers 102C and 102D were connected to the network 104. 
The identification data adjunct 200 and the processing situation judgment part 302 are formed in the 
clients 100C, 100D, and 100E, respectively. The client identification part 202 and the processing 
situation report section 300 are formed in the servers 102C and 102D, respectively. 
[0038]The servers 102C and 102D report a processing situation to the clients 100C, 100D, and 100E. 
The clients 100C, 100D, and 100E choose what has light load according to the processing situation by 
the side of a server, and transmit the data which adds self identification data and serves as a 
processing object. In the server which received the data transfer, the processing is performed and 
the data after processing is transmitted to an applicable client. 

[0039]Thus, according to this example, two or more server and two or more clients are prepared on a 
network. And what has light load is arbitrarily chosen from two or more servers, and two or more 
clients require processing. Since distributed processing is performed in the server side, even if it 
becomes a large-scale network, it becomes possible to correspond flexibly, and hardware resources 
come to be utilized still more effectively. 
[0040] 

[Other Example(s)]It is possible for there to be many embodiments in this invention and to change to 
Oshi based on the above indication. For example, the following is also contained. 
[0041](1) As each element, such as a camera which constitutes a system, and a display device, 
various kinds of things are known and any may be used. For example, as the sensor 10, an infrared 

http://www4ipdlinpit.go jp/cgi-bin/tran_web_cgi_ejje?atw^^^ 11/03/11 



JP,10-214258A [DETAILED DESCRIPTION] 



6/6 



sensor, a vibration sensor, a sound sensor, etc. can be used. Neither a picture nor audio recognition 
nor a composite algorithm is also limited to the above-mentioned embodiment at all, and may use 
various kinds of techniques. It may be made to perform processings other than recognition or 
composition. The number of the client linked to a network or servers may also be fluctuated suitably 
if needed. 

[0042](2) Although said embodiment explained as an example the case of a system which detects 
approach of a person and recognizes the face and a sound, if it is a system which performs a certain 
processing to a picture and a sound, it is applicable to anythings. 
[0043] 

[Effect of the Invention]As explained above, according to this invention, there are the following 

effects. 

(1) Since it separates into the client which performs input and output of a picture or a sound for a 
system, and the server which performs recognition and the compositing process of voice data or 
image data, spatial restrictions are reduced. 

(2) Since two or more clients share a server through a network, effective use of hardware resources 
can be aimed at and reduction of cost is attained. 

(3) Since distributed processing is carried out by two or more servers, the burden of each server is 
reduced and the hardware which is inferior in capability can be utilized effectively. 



[Translation done,] 



http://www4.ipdlinpitgojp/cgi-bin/tran_web_cgi_ejje?atw_u=^^ 11/03/11 



JP,10-214258,A [DESCRIPTION OF DRAWINGS] 1/1 ^— 

* NOTICES * 

JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

1. This document has been translated by computer. So the translation may not reflect the original 
precisely. 

2. **** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



DESCRIPTION OF DRAWINGS 

[Brief Description of the Drawings] 

[Drawing 1] It is a block diagram showing the composition of Embodiment 1 of this invention. 

[Drawing 2] It is a block diagram showing the main composition of the embodiment of this invention. 

[Drawing 3 ] It is a block diagram showing an example of the conventional system. 

[Drawing 4] It is a block diagram showing the example using the background art of drawing 3 t wo or 

more. 

[Explanations of letters or numerals] 

10 — Sensor 

12 — Sensor reader 

14 — Sensor read station 

16 — Camera 

18 — Picture input device 

20 — Image input part 

22 — Microphone 

24 — Speech input system 

26 — Voice input part 

28 — Display device 

30 — Image output device 

32 — Image output part 

34 — Loudspeaker 

36 — Speech output unit 

38 — Voice output part 

40 — Image recognition device 

42 — Voice recognition equipment 

44 — Image compositing device 

46 — Voice synthesizer 

48,106,110 — Control device 

100 ~ Client 

1 02 — Server 

1 04 — Network 

108,112 — Network interface 

200 — Identification data adjunct 

202 — Client identification part 

300 — Processing situation report section 

302 — Processing situation judgment part 
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